Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

In clinical trials, identification of prognostic and predictive biomarkers has became essential to precision medicine. Prognostic biomarkers can be useful for the prevention of the occurrence of the disease, and predictive biomarkers can be used to identify patients with potential benefit from the treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso, that integrates prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the traditional Lasso and other extensions on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic and proteomic data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05143-0.


Introduction
With the development of precision medicine, there has been an increasing interest in the discovery of different types of biomarkers. A prognostic biomarker informs about a likely clinical outcome (e.g., disease recurrence, disease progression, death) in the absence of therapy or with a standard therapy that patients are likely to receive, while a predictive biomarker is associated with a response or a lack of response to a specific therapy. Fourati [3] and Clark [8] provided a comprehensive explanation and concrete examples to distinguish prognostic from predictive biomarkers, respectively. During the past decade, prognostic and predictive biomarkers showed their power in the development of precision medicine. Giannos et al. [15] identified ten prognostic gene biomarkers for nonsmall cell lung cancer that can be useful for improving risk prediction and therapeutic strategies. Zhao et al. [38] obtained prognostic biomarkers for 13 cancers by integrating multi-omics data and provided a reference for translational medicine researchers. He et al. [16] identified predictive biomarkers for sorafenib resistance and contributed to the guidance of individualized drug therapy. Yet, correctly identifying such biomarkers remains difficult.
Concerning the biomarker selection, the high dimensionality of genomic data is one of the main challenges as explained in Fan and Li [9]. To identify effective biomarkers in high-dimensional settings, several approaches can be considered including hypothesis-based tests described in [23], wrapper approaches proposed in [25], and penalized approaches such as Lasso designed by [30] among others. Hypothesis-based tests consider each biomarker independently and thus ignore potential correlations between them. Wrapper approaches often show high risk of overfitting and are computationally expensive for high-dimensional data as explained in [27]. More efforts have been devoted to penalized methods given their ability to automatically perform variable selection and coefficient estimation simultaneously as highlighted in [10]. However, Lasso showed some potential drawbacks when biomarkers are highly correlated. Particularly, when the Irrepresentable Condition (IC) proposed by [39] is violated, Lasso can not guarantee to correctly identify true effective biomarkers. In genomic data, biomarkers are usually highly correlated such that this condition can hardly be satisfied, see [34]. Several methods have been proposed to adress this issue. Elastic Net [41] combines the ℓ 1 and ℓ 2 penalties and is particularly effective in tackling correlation issues and can generally outperform Lasso. Adaptive Lasso [42] proposes to assign adaptive weights for penalizing different coefficients in the ℓ 1 penalty, and its oracle property was demonstrated. Wang and Leng [35] proposed the HOLP approach which consists in removing the correlation between the columns of the design matrix; Wang et al. [34] proposed to handle the correlation by assigning similar weights to correlated variables in their approach called Precision Lasso; Zhu et al. [40] proposed to remove the correlations by applying a whitening transformation to the data before using the generalized Lasso criterion designed by [31].
The challenge of finding prognostic biomarkers has been extensively explored with previously introduced methods, however, the discovery of predictive biomarkers has seen much less attention. Limited to binary endpoint, Foster et al. [13] proposed to first predict response probabilities for treatment and use this probability as the response in a classification problem to find effective biomarkers. Tian et al. [29] proposed a new method to detect interaction between the treatment and the biomarkers by modifying the covariates. This method can be implemented on continuous/binary/time-to-event endpoint. Lipkovich et al. [20] proposed a method called SIDES, which adopts a recursive partitioning algorithm for screening treatment-by-biomarker interactions. This method was further improved in [19] by adding another step of preselection on predictive biomarkers based on variable importance. The method was demonstrated with continuous endpoint. Evaluated on time-to-event data, Ternès et al. [28] proposed a framework for identifying biomarker-by-treatment interactions but not specifically in the context of correlated biomarkers. More recently, Sechidis et al. [26] applied approaches coming from information theory for ranking biomarkers on their prognostic/predictive strength.
Their method is applicable only for binary or time-to-event endpoint. Moreover, all of these methods were assessed under the situation where the sample size is relatively large and the number of biomarkers is limited, which is hardly the case for genomic data.
In the literature mentioned above, the authors focused on one of the problematic of identifying prognostic or predictive biomarkers, but rarely on both. Even if predictive biomarkers is of major importance for identifying patients more likely to benefit from a treatment, the prognostic biomarkers is also key in this context. Indeed, the clinical impact of a treatment can be judged only with the knowledge of the prognosis of a patient. It is thus of importance to reliably predict the prognosis of patients to assist treatment counseling [36]. To properly describe the two effects, the experimental treatment should be compared to a standard therapy (or a placebo), and patients receiving different treatments should be randomized. A randomized clinical trial can be ideal for such a study. In this paper, we developed a new method called PPLasso (Prognostic Predictive Lasso) to identify prognostic and at the same time predictive biomarkers in a high dimensional setting with continuous endpoints, as presented in "Methods" section . Extensive numerical experiments are given in "Numerical experiments" section to assess the performance of our approach and to compare it to other methods. PPLasso is also applied on two publicly available transcriptomic and proteomic data in "Application to transcriptomic and proteomic data" section. Finally, we give concluding remarks in "Conclusion" section .

Methods
In this section, we propose a novel approach called PPLasso (Predictive Prognostic Lasso) which consists in writing the identification of predictive and prognostic biomarkers as a variable selection problem in an ANCOVA (Analysis of Covariance) type model mentioned for instance in [12].

Statistical modeling
Let y be a continuous response or endpoint and t 1 , t 2 two treatments. Let also X 1 (resp. X 2 ) denote the design matrix for the n 1 (resp. n 2 ) patients with treatment t 1 (resp. t 2 ), each containing measurements on p candidate biomarkers: To take into account the potential correlation that may exist between the biomarkers in the different treatments, we shall assume that the rows of X 1 (resp. X 2 ) are independent centered Gaussian random vectors with a covariance matrice equal to 1 (resp. 2 ).
To model the link that exists between y and the different types of biomarkers we propose using the following model: (1) where (y i1 , . . . , y in i ) corresponds to the response of patients with treatment t i , i being equal to 1 or 2, with α 1 (resp. α 2 ) corresponding to the effects of treatment t 1 (resp. t 2 ). Moreover, β 1 = (β 11 , β 12 , . . . , β 1p ) ′ (resp. β 2 = (β 21 , β 22 , . . . , β 2p ) ′ ) are the coefficients associated to each of the p biomarkers in treatment t 1 (resp. t 2 ) group, ′ denoting the matrix transposition and ǫ 11 , . . . , ǫ 2n 2 are standard independent Gaussian random variables independent of X 1 and X 2 . When t 1 stands for the standard treatment or placebo, prognostic biomarkers are defined as those having non-zero coefficients in β 1 . According to the definition of prognostic biomarkers, their effect should indeed be demonstrated in the absence of therapy or with a standard therapy that patients are likely to receive. On the other hand, predictive biomarkers are defined as those having non-zero coefficients in β 2 − β 1 because they aim to highlight different effects between two different treatments. Model (2) can be written as: The Lasso penalty is a well-known approach to estimate coefficients with a sparsity enforcing constraint allowing variable selection by estimating some coefficients by zero. It consists in minimizing the following penalized least-squares criterion [30]: where �u� 2 2 = n i=1 u 2 i and �u� 1 = n i=1 |u i | for u = (u 1 , . . . , u n ) . A different sparsity constraint was applied to β 1 and β 2 − β 1 to allow different sparsity levels. Hence we propose to replace the penalty γ 1 in (4) by Thus, a first estimator of γ could be found by minimizing the following criterion with respect to γ: , with Id p denoting the identity matrix of size p and 0 i,j denoting a matrix having i rows and j columns and containing only zeros. However, since the inconsistency of Lasso biomarker selection is originated from the correlations between the biomarkers, we propose to remove the correlation by "whitening" the matrix X . More precisely, we consider X = X −1/2 , where and define −1/2 by replacing in (7)  where γ = � 1/2 γ . The objective function (6) thus becomes:

Estimation of γ
Let us define a first estimator of γ = ( α 1 , α 2 , β ′ 1 , β ′ 2 ) as follows: for each fixed 1 and 2 . To better estimate β 1 and β 2 , a thresholding was applied to Note that the corrections are only performed on β 0 , the estimators α 1 and α 2 were not modified. The choice of K 1 and K 2 will be explained in "Choice of the parameters K 1 , K 2 , M 1 and M 2 " section.
To illustrate the interest of using a thresholding step, we generated a dataset based on Model 3 with parameters described in "Simulation setting" section and p = 500 . Moreover, to simplify the graphical illustrations, we focus on the case where Figure 1 displays the estimation error associated to the estimators of β( ) before and after the thresholding. We can see from this figure that the estimation of β( ) is less biased after the correction. Moreover, we observed that this thresholding strongly improves the final estimation of γ and the variable selection performance of our method.

Estimation of γ
, the estimators of β 1 and β 2 − β 1 can be obtained by β 10 As previously, another thresholding was applied to β 10 and β 20 : for i = 1 or 2, for each fixed 1 and 2 . The biomarkers with non-zero coefficients in ) are considered as prognostic (resp. predictive) biomarkers, where the choice of M 1 and M 2 is explained in in "Choice of the parameters K 1 , K 2 , M 1 and M 2 " section .
To illustrate the benefits of using an additional thresholding step, we used the dataset described in "Estimation of γ " section. Moreover, to simplify the graphical illustrations, we also focus on the case where 1 = 2 = . Additional file 1: Figure S1 displays the number of True Positive (TP) and False Positive (FP) in prognostic and predictive biomarker identification with and without the second thresholding. We For each ( 1 , 2 ) and each K 1 , we computed: ) defined in (10) and in (11). It is displayed in the left part of Fig. 2.
For each 1 , 2 and a given δ ∈ (0, 1) , the parameter K 2 is then chosen as follows for each K 1 : The K 2 associated to each K 1 are displayed with '*' in the left part of Fig. 2. Then K 1 is chosen by using a similar criterion: The values of MSE (K 1 , K 2 ) ( 1 , 2 ) are displayed in the right part of Fig. 2 in the particular case where 1 = 2 = , δ = 0.95 and with the same dataset as the one used in "Estimation of γ " section. K 1 is displayed with a red star. This value of δ will be used in the following sections. However, choosing δ in the range (0.9,0.99) does not have a strong impact on the variable selection performance of our approach.
The parameters M 1 and M 2 are chosen in a similar way except that MSE K 1 , ) defined in (10) and (12). In the following,

Estimation of 1 and 2
As the empirical correlation matrix is known to be a non accurate estimator of when p is larger than n, a new estimator has to be used. Thus, for estimating we adopted a cross-validation based method designed by [5] and implemented in the cvCovEst R package [6]. This method chooses the estimator having the smallest estimation error among several compared methods (sample correlation matrix, POET [11] and Tapering [7] as examples). Since the samples in treatments t 1 and t 2 are assumed to be collected from the same population, 1 and 2 are assumed to be equal.

Choice of the parameters 1 and 2
For the sake of simplicity, we limit ourselves to the case where 1 = 2 = . For choosing we used BIC (Bayesian information criterion) which is widely used in the variable selection field and which consists in minimizing the following criterion with respect to : where n is the total number of samples, MSE( ) = �y − X γ ( )� 2 2 and k( ) is the number of non null coefficients in the OLS estimator γ obtained by re-estimating only the non null components of β 1 and β 2 − β 1 . The values of the BIC criterion as well as those of the MSE obtained from the dataset described in "Estimation of γ " section are displayed in Fig. 3.
Additional file 1: Table S1 provides the True Positive Rate (TPR) and False Positive Rate (FPR) when is chosen either by minimizing the MSE or the BIC criterion for this dataset. We can see from this table that both of them have TPR=1 (all true positives are identified). However, the FPR based on the BIC criterion is smaller than the one obtained by using the MSE.
Note that additional results using two different parameters 1 and 2 in the BIC criterion are provided in "Two parameters 1 and 2 v.s. in the BIC Criterion" section.

Numerical experiments
This section presents a comprehensive numerical study by comparing the performance of our method with other regularized approaches in terms of prognostic and predictive biomarker selection. Besides the Lasso, we also compared with Elastic Net, Adaptive Lasso and WLasso [40] since they also take into account the correlations. For these compared methods, in order to directly estimate prognostic and predictive effects, X and γ in Model (3) were replaced by and γ * = (α 1 , α 2 , β * 1 , β * 2 ) , respectively, where X 1 and X 2 are defined in (1), 0 i,j (resp. 1 i,j ) denotes a matrix having i rows and j columns and containing only zeros (resp. ones). Note that this is the modeling proposed by [21]. The sparsity enforcing constraint was (14) BIC( ) = n log(MSE( )/n) + k( ) log(n), X * = 1 n 1 ,1 0 n 1 ,1 X 1 0 n 1 ,p 0 n 2 ,1 1 n 2 ,1 X 2 X 2 , put on the coefficients β * 1 and β * 2 which boils down to putting a sparsity enforcing constraint on β 1 and β 2 − β 1 .

Simulation setting
All simulated datasets were generated from Model (3) where the n 1 ( n 2 ) rows of X 1 ( X 2 ) are assumed to be independent Gaussian random vectors with a covariance matrix 1 = 2 = bm , and ǫ is a standard Gaussian random vector independent of X 1 and X 2 . We defined bm as: where 11 (resp. 22 ) are the correlation matrix of prognostic (resp. non-prognostic) biomarkers with off-diagonal entries equal to a 1 (resp. a 3 ). Morever, 12 is the correlation matrix between prognostic and non-prognostic variables with entries equal to a 2 . In our simulations (a 1 , a 2 , a 3 ) = (0.3, 0.5, 0.7) , which is a framework proposed by [37]. We checked that the Irrepresentable Condition (IC) of [39] is violated and thus the standard Lasso cannot recover the positions of the null and non null variables. For each dataset we assumed randomized treatment allocation between standard and experimental arm with a 1:1 ratio, i.e. n 1 = n 2 = 50 . We further assume a relative treatment effect of 1 ( α 1 = 0 and α 2 = 1 ). The number of biomarkers p varies from 200 to 2000. The number of active biomarkers was set to 10 (i.e. 5 purely prognostic biomarkers with β 1j = β 2j = b 1 = 1 (j = 1, ..., 5) and 5 biomarkers both prognostic and predictive with β 1j = b 1 and β 2j = b 2 = 2 (j = 6, ..., 10)).

Evaluation criteria
We considered several evaluation criteria to assess the performance of the methods in selecting the prognostic and predictive biomarkers: the TPR prog as the true positive rate (i.e. rate of active biomarkers selected) and FPR prog the false positive rate (i.e. rate of inactive biomarkers selected) of the selection of prognostic biomarkers, and similarly for predictive biomarkers with TPR pred and FPR pred . We further note TPR all and FPR all the criterion of overall selection among all candidate biomarkers regardless their prognostic (15) bm = 11 12 T 12 22

Fig. 3 MSE and BIC for all . The minimizing each criterion is displayed with a vertical line
or predictive effect. The objective of the selection is to maximize the TPR all and minimize the FPR all . All metrics were calculated by averaging the results of 100 replications for each scenario.

Two parameters 1 and 2 v.s. in the BIC Criterion
In this section, we compare the results obtained by choosing 1 = 2 = as the minimizer of the BIC criterion described in (14) with those obtained by choosing the values of 1 and 2 as those minimizing the criterion (16): Different results are presented in Fig. 4. PPLasso (resp. PPLasso ) corresponds to the results of the method by using the true (resp. estimated) matrix bm . For estimating bm , we used the approach explained in "Estimation of 1 and 2 " section . Different choices of parameters are also given: "optimal", " min(bic( )) " and " min(bic( 1 , 2 )) ". The first one uses as a value of the parameters the one maximizing (TPR all − FPR all ) , the second one uses the approach presented in (14) and the last one uses the approach described in (16). We observed that the results with two tuning parameters ( 1 , 2 ) were slightly better than those with a single parameter . However, the gap is very small and almost invisible when p increases. For this reason, we limited ourselves to a single tuning parameter in the following.

Biomarker selection results
In order to compare the performance of our approach to the best performance that could be reached by Elastic Net, Lasso, Adaptive Lasso and WLasso, we used for these methods the "optimal" parameters namely those maximizing (TPR all − FPR all ) . The first three methods were implemented with the glmnet R package, the best parameter α involved in Elastic Net was chosen in the set {0.1, 0.2, . . . , 0.9} . WLasso was implemented with the WLasso R package. The choice of "min(bic)" is only applied to our method and corresponds to a choice of that could be used in practical situations. For ease of presentation, the abbreviation EN (resp. AdLasso) refers to Elastic Net (resp. Adaptive Lasso) in the following. Figure 5 shows the selection performance of PPLasso and other compared methods in the simulation scenario presented in "Simulation setting" section. PPLasso achieved to select all prognostic biomarkers ( TPR prog almost 1) even for large p, with limited false positive prognostic biomarkers selected. As compared to the optimal maximizing (TPR all − FPR all ) , the one selected with the BIC tends to select some false positives (average: 33 ( FPR prog = 0.17 ) for p = 200 and 10 ( FPR prog = 0.005 ) for p = 2000 ). The results obtained from the oracle and estimated bm are comparable. Selection performance of predictive biomarkers is slightly lowered as compared to prognostic biomarkers. Even if the false positive selection is quite similar between prognostic and predictive biomarkers, PPLasso missed some true predictive biomarkers when is selected with the BIC criterion (average TPR pred = 0.98 and 0.80 for oracle and estimated bm , respectively, with p = 2000 ). In this scenario where the IC is violated, PPLasso globally outperforms Lasso, Elastic Net, Adaptive Lasso and WLasso. Thanks to the whitening technique used in WLasso, it achieved higher (16) BIC( 1 , 2 ) = n log(MSE( 1 , 2 )/n) + k( 1 , 2 ) log(n).
selection accuracy than the other three methods. Although Elastic Net showed higher TPR than Lasso and Adaptive Lasso, they all failed in selecting all truly prognostic and predictive biomarkers, and the number of missed active biomarkers increased with the dimension p. For example, for Elastic Net, TPR prog = 0.85 and 0.53, TPR pred = 0.81 and 0.61 for p = 200 and 2000, respectively.

Impact of the correlation matrix
To evaluate the impact of the correlation matrix on the selection performance of the methods, additional scenarios are presented where the IC is satisfied: 1. Compound symmetry structure where all biomarkers are equally correlated with a correlation ρ = 0.5; 2. Independent setting where bm is the identity matrix. For the scenario with compound symmetry structure displayed in Fig. 6, all the methods successfully identified the true prognostic biomarkers ( TPR prog close to 1 even for large p) with limited false positive selection. On the other hand, the compared methods (Lasso, ELastic Net, Adaptive Lasso and WLasso) missed some predictive biomarkers especially when p increases.
On the contrary, PPLasso successfully identified almost all predictive biomarkers with the optimal choice of . Moreover, even when is selected by minimizing the BIC criterion (min(bic)), PPLasso est outperformed Lasso and Adaptive Lasso when p > 500 with relatively stable TPR pred and FPR pred as p increases.
For the independent setting, as displayed in Fig. 7, prognostic biomarkers were globally well identified by all the compared methods with a slightly higher TPR prog for Lasso and ELastic Net as compared to PPLasso but also with a slightly higher FPR prog . With regards to predictive biomarkers, PPLasso using bm (oracle) performed also similarly to the Lasso, which is reasonable since no transformation has been used in PPLasso. On the other hand, even if PPLasso with selected with "min(bic)" performed similarly with PPLasso with optimal for relatively small p, the selection performance is altered for large p and even if the performance is higher than Lasso and Adaptive Lasso, it is smaller than the one of Elastic Net.

Impact of the effect size of active biomarkers
To evaluate the impact of the effect size on biomarker selection performance, the scenario presented in "Simulation setting" section was considered with different values of b 2 : 1.5, 2 and 2.5. Since the effect size of prognostic biomarkers did not change, the comparison focused on predictive biomarkers. As expected, the reduction of the effect size makes the biomarker selection harder, especially for Lasso, Elastic Net and Adaptive Lasso where the predictive biomarker selection is limited when b 2 = 1.5 : for Lasso when p = 2000 , TPR pred = 0.45 (resp. 0.22) for b 2 = 2 (resp. 1.5), see Fig. 5 and Additional file 1: Figure S2. The selection performance of PPLasso when is selected with min(bic) is also reduced by decreasing b 2 , especially when bm is also estimated. Nevertheless, the selection performance of PPLasso remains better than for most of the other compared methods for which the performance displayed are associated to the optimal value of . Surprisingly, WLasso performed better than PPLasso with estimated in this scenario. On the other hand, even with limited effect size, PPLasso with optimal identified all predictive biomarkers with very limited false positive selection. When b 2 was increased to 2.5, the selection performance for all methods is improved and the results for PPLasso with estimated was close to the ones with the optimal as displayed in Additional file 1: Figure S3. As compared with PPLasso, for which the selection performance remained stable as p increased, Lasso, Elastic Net, Adaptive Lasso and WLasso were more impacted by the value of p since the true positive selection decreased as p increased. As an example, for the Lasso, TPR pred =0.95 (resp. 0.65) for p = 200 (resp. 2000).

Impact of the number of predictive biomarkers
The impact of the number of true predictive biomarkers was assessed by increasing the number of predictive biomarkers from 5 to 10 in the scenario presented in "Simulation setting" section. When the number of predictive biomarkers increased, the impact on PPLasso is almost negligible, especially for prognostic biomarker identification. However, for the other methods, we can see from Additional file 1: Figure S4 that it became even harder to identify predictive biomarkers. The impact on WLasso was less obvious, while for the other methods, TPR pred decreased compared to Fig. 5, especially for large p (e.g. TPR pred = 0.12, 0.18, and 0.02 for Lasso, Elastic Net and Adaptive Lasso respectively when p = 2000).

Impact of the dimension of the dataset
In this section, we studied a different sample size: n = 50 with n 1 = n 2 = 25 and a different number of biomarkers: p = 5000.
We can see from Additional file 1: Figure S5 that for p = 5000 , the selection performance of PPLasso is not altered as compared with p = 2000 while the compared methods have more difficulties to identify both prognostic and predictive biomarkers.
When the sample size is smaller (n = 50), we can see from Additional file 1: Figure  S6 that the ability to identify prognostic and predictive biomarkers decreased for all the methods. However, PPLasso still outperformed the others with higher TPR prog and TPR pred and lower FPR prog and FPR pred .

Application to the RV144 clinical trial transcriptomic data
We applied the previously described methods to publicly available transcriptomic data from the RV144 vaccine trial [24]. This trial showed reduced risk of HIV-1 acquisition by 31.2% with vaccination with ALVAC and AIDSVAX as compared to placebo. Transcriptomic profiles of in vitro HIV-1 Env-stimulated peripheral blood mononuclear cells (PBMCs) obtained pre-immunization and 15 days after the immunization (D15) from both 40 vaccinees and 10 placebo recipients were generated to better understand underlying biological mechanisms.
For illustration purpose, the absolute change at D15 in gene mTOR was considered as the continuous endpoint (response). mTOR plays a key role in mTORC1 signaling pathway which has been shown to be associated with risk of HIV-1 acquisition [14,1]. The gene expression has been normalized as in the original publication of [14]. After removing non-annotated genes (LOCxxxx and HS.xxxx), the top 2000 genes with the highest empirical variances were included as candidate biomarkers for prognostic and predictive identification from PPLasso and the compared methods. The penalty parameter for the Lasso and Adaptive Lasso, the parameters and α for Elastic Net were selected through the classical cross-validation approach. For PPLasso, was selected based on the criterion described in "Choice of the parameters 1 and 2 " section .
The estimation of was obtained by comparing several candidate estimators from the cvCovEst R package and by selecting the estimator having the smallest estimation error. In this application, the combination of the sample covariance matrix and a dense target matrix (denseLinearShrinkEst) derived by [18] provides the smallest estimation error. Figure 8 (left) displays the estimated and highlights the strong correlation between the genes. Additional file 1: Table S2 gives details on the compared estimators.
Prognostic and predictive genes selected by PPLasso, Lasso, Elastic Net and Adaptive Lasso are listed in Table 1. The number of genes selected are similar for all the compared methods, except for a slightly higher number of predictive genes selected by PPLasso. Lasso, Elastic Net and Adaptive Lasso selected very similar sets of prognostic and predictive genes. The intersection between PPLasso and others is moderate (2 prognostic genes (SLAMF7 and TNFRSF6B), 3 predictive genes (YTHDC1, MS4A7 and RPL21)).
To have a better overview of the prognostic and predictive genes selected by the different methods and their associated roles, pathway analysis was carried out via the REACTOME tool (https:// react ome. org/), where over-representation analysis (ORA) was performed. ORA is used to determine if a set of genes shares more genes with a pathway than we would expect by chance, evaluated by a p-value. Table S3 (prognostic biomarkers) and Additional file 1: Table S4 (predictive biomarkers) showed the identified pathways with a p-value smaller than 0.01. For prognostic biomarkers, there was no pathway identified by WLasso. Most of the pathways identified by Adaptive Lasso were also identified by Elastic Net. Lasso identified a large number of pathways, but some of them may not be related to HIV. PPLasso identified three pathways (also identified by Elastic Net and Lasso). Interestingly, TNFR2 non-canonical NF-kB pathway that was already identified by Fourati et al. [14], is associated with the risk of HIV acquisition in the placebo group; the implication of regulatory T-cells on HIV-1 has also been widely discussed in the literature (e.g., [17]). For predictive biomarkers identified by different methods, Lasso, Elastic Net, and Adaptive Lasso identified comparable pathways, while PPLasso and WLasso share similar ones and different from the other methods. Among the pathways identified by PPLasso, NOD1/2 Signaling Pathway and Toll-like Receptor Cascades pathway are reported as potential targeted adjuvants for HIV-1 vaccines [22]. In addition, RIPK1-mediated regulated necrosis pathway has also been investigated as targets for HIV-1 protease activity during infection [33].

Application to the NCT01241591 clinical trial proteomic data
Baseline blood samples of 173 samples ( n = 81 and 92) were taken from patients included in a randomized phase 3 clinical trial comparing the efficacy and safety of tofacitinib and etanercept in moderate-to-severe chronic plaque psoriasis (https:// clini caltr ials. gov/ ct2/ show/ NCT01 241591) [2]. From these samples, 92 inflammation-associated proteins and 65 cardiovascular disease-associated proteins were measured. Response to treatment was evaluated on the change in PASI score from baseline to week 12. The aim of this application is to identify potential prognostic (proteins associated with the clinical endpoint under standard therapy: etanercept) and predictive biomarkers (proteins differentially associated with the clinical endpoint between etanercept and tofacitinib, aiming to identify patients more likely to benefit from a specific treatment). Figure 8 (right) displays the estimated and shows positive correlations between the proteins after standardization. Prognostic and predictive proteins selected by different methods are listed in Table 2. Among the identified prognostic proteins, IL-8 (identified by PPLasso and WLasso) and IL-17C (identified by Elastic Net and WLasso) both contribute to the IL-17 pathway of psoriasis pathogenesis mechanism [4]. For predictive proteins,  CSF-1, identified by PPLasso, has been shown to be in the core signatures to predict tofacitinib treatment response developed by Tomalin et al. [32]. Lasso selected no proteins. Adaptive Lasso selected only one predictive protein.

Conclusion
We propose a new method named PPLasso to simultaneously identify prognostic and predictive biomarkers. PPLasso is particularly interesting for dealing with high dimensional omics data when the biomarkers are highly correlated, which is a framework that has not been thoroughly investigated yet. From various numerical studies with or whithout strong correlation between biomarkers, we highlighted the strength of PPLasso in well identifying both prognostic and predictive biomarkers with limited false positive selection. The current method is only dedicated to the analysis of continuous responses through ANCOVA type models. However, it will be the subject of a future work to extend it to other challenging contexts, such as classification or survival analysis.

Additional file 1. Supplementary material.
Author contributions WZ derived the algorithms, developed the package, performed the analysis, and wrote a first version of the manuscript. CL supervised the work, was involved in theoretical investigations, gave important suggestions and participated to the writing of the paper. NT initialized the development of the algorithms, supervised the work and edited the manuscript. All authors read and approved the final manuscript.

Funding
This work was supported by the Association Nationale Recherche Technologie (ANRT).